
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <title>Stop Words &#8212; PyCantonese 2.2.0 documentation</title>
    <link rel="stylesheet" href="_static/sphinxdoc.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    <script type="text/javascript" id="documentation_options" data-url_root="./" src="_static/documentation_options.js"></script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <link rel="index" title="Index" href="genindex.html" />
    <link rel="search" title="Search" href="search.html" />
    <link rel="next" title="Corpus Reader Methods" href="reader.html" />
    <link rel="prev" title="Corpus Data" href="data.html" /> 
  </head><body>
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="reader.html" title="Corpus Reader Methods"
             accesskey="N">next</a> |</li>
        <li class="right" >
          <a href="data.html" title="Corpus Data"
             accesskey="P">previous</a> |</li>
        <li class="nav-item nav-item-0"><a href="index.html">PyCantonese 2.2.0 documentation</a> &#187;</li> 
      </ul>
    </div>
      <div class="sphinxsidebar" role="navigation" aria-label="main navigation">
        <div class="sphinxsidebarwrapper">
  <h4>Previous topic</h4>
  <p class="topless"><a href="data.html"
                        title="previous chapter">Corpus Data</a></p>
  <h4>Next topic</h4>
  <p class="topless"><a href="reader.html"
                        title="next chapter">Corpus Reader Methods</a></p>
<div id="searchbox" style="display: none" role="search">
  <h3>Quick search</h3>
    <div class="searchformwrapper">
    <form class="search" action="search.html" method="get">
      <input type="text" name="q" />
      <input type="submit" value="Go" />
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
    </div>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body" role="main">
            
  <div class="section" id="stop-words">
<span id="id1"></span><h1>Stop Words<a class="headerlink" href="#stop-words" title="Permalink to this headline">¶</a></h1>
<p>In many natural language processing tasks, it is often necessary to filter
stop words, English examples of which include function words such as
pronouns and determiners. PyCantonese provides the function <code class="docutils literal notranslate"><span class="pre">stop_words()</span></code>
that returns a set of about 100 Cantonese stop words:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">pycantonese</span> <span class="kn">as</span> <span class="nn">pc</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">stop_words</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">stop_words</span><span class="p">()</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">len</span><span class="p">(</span><span class="n">stop_words</span><span class="p">)</span>
<span class="go">104</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">stop_words</span>
<span class="go">{&#39;一啲&#39;, &#39;一定&#39;, &#39;不如&#39;, &#39;不過&#39;, ...}</span>
</pre></div>
</div>
<p>Depending on your use cases, you may like to add or remove stop words
from the default ones.
The <code class="docutils literal notranslate"><span class="pre">stop_words()</span></code> function has the optional arguments of <code class="docutils literal notranslate"><span class="pre">add</span></code> and
<code class="docutils literal notranslate"><span class="pre">remove</span></code>.</p>
<p><code class="docutils literal notranslate"><span class="pre">add</span></code> can either be a string (e.g., treat <code class="docutils literal notranslate"><span class="pre">'香港'</span></code> as a stop word if your
data is all about Hong Kong) or an iterable of strings:</p>
<div class="highlight-python notranslate"><div class="highlight"><pre><span></span><span class="gp">&gt;&gt;&gt; </span><span class="kn">import</span> <span class="nn">pycantonese</span> <span class="kn">as</span> <span class="nn">pc</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">stop_words_1</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">stop_words</span><span class="p">(</span><span class="n">add</span><span class="o">=</span><span class="s1">&#39;香港&#39;</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">len</span><span class="p">(</span><span class="n">stop_words_1</span><span class="p">)</span>
<span class="go">105</span>
<span class="gp">&gt;&gt;&gt; </span><span class="s1">&#39;香港&#39;</span> <span class="ow">in</span> <span class="n">stop_words_1</span>
<span class="go">True</span>
<span class="gp">&gt;&gt;&gt; </span><span class="n">stop_words_2</span> <span class="o">=</span> <span class="n">pc</span><span class="o">.</span><span class="n">stop_words</span><span class="p">(</span><span class="n">add</span><span class="o">=</span><span class="p">[</span><span class="s1">&#39;香港島&#39;</span><span class="p">,</span> <span class="s1">&#39;九龍&#39;</span><span class="p">,</span> <span class="s1">&#39;新界&#39;</span><span class="p">])</span>
<span class="gp">&gt;&gt;&gt; </span><span class="nb">len</span><span class="p">(</span><span class="n">stop_words_2</span><span class="p">)</span>
<span class="gp">&gt;&gt;&gt; </span><span class="mi">107</span>
<span class="gp">&gt;&gt;&gt; </span><span class="p">{</span><span class="s1">&#39;香港島&#39;</span><span class="p">,</span> <span class="s1">&#39;九龍&#39;</span><span class="p">,</span> <span class="s1">&#39;新界&#39;</span><span class="p">}</span><span class="o">.</span><span class="n">issubset</span><span class="p">(</span><span class="n">stop_words_2</span><span class="p">)</span>
<span class="go">True</span>
</pre></div>
</div>
<p>Similarly, the <code class="docutils literal notranslate"><span class="pre">remove</span></code> argument can also take either a string or an iterable
of strings.</p>
</div>


          </div>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="reader.html" title="Corpus Reader Methods"
             >next</a> |</li>
        <li class="right" >
          <a href="data.html" title="Corpus Data"
             >previous</a> |</li>
        <li class="nav-item nav-item-0"><a href="index.html">PyCantonese 2.2.0 documentation</a> &#187;</li> 
      </ul>
    </div>
    <div class="footer" role="contentinfo">
        &#169; Copyright 2014-2018, Jackson L. Lee | Documentation last updated on June 30, 2018.
      Created using <a href="http://sphinx-doc.org/">Sphinx</a> 1.7.5.
    </div>
  </body>
</html>